Udemy: Python for Data Science and Machine Learning Bootcamp
Study material stored @ OneDrive/Documents/Study Documents/Online courses
Section 5. Numpy arrays
-
numpy array, vector vs matrix
np.arange: uses step size vsnp.linspace: uses length as inputnp.eye(n): creates n-dimensional diagonal matrix, i.e. identity matrixnp.randomnp.random.rand: uniform(0, 1)np.random.randn: $N(0,1)$np.random.randint: random integer between low and high input
- reshape method of an array: shape feature/attribute of an array
.max()/.min(): returns value,.argmax()/.argmin(): returns index label- Compared with
.argX()functionsidxmax()oridxmin()may be preferred
- Compared with
array.dtype
- numpy array indexing
- array slicing => a new array: now a copy of the sliced part but just a pointer to the original array
- to make a real copy, use
array.copy()method
- to make a real copy, use
np.array()can be used to create an array instance- indexing of 2-d array: if only one index is provided -> refers to the row number, eg
array[0]-> first row- to get a single element, using double
[, i.e.[[]], can achieve it; BUT, single bracket with momma is preferred
- to get a single element, using double
-
Universal functions: unfun
-
broadcasting
Section 6. Pandas
- DataFrames - Part 1
df.drop(axis=0/1, inplace = True), withoutinplace = True, pd will just wake make a copy of the DF after dropping, or assign DF is unchanged- Indexing a DF:
df.loc["row index name"]df.iloc[row numeric index number]
- DataFrames - Part 2
- ⭐ Python's
andkeyword works only for scalar scalar booleen object but not for array of Boolean; should use&for arrays - rest index:
df.reset_index(), original index => a column; no inplace - set index:
df.set_index(column number)
- ⭐ Python's
- DataFrames - Part 3
pd.multiIndex.from_turple(input turple)df.xs(index value, level = index name): cross-section method
- Missing data
df.dropna(default axis = 0): drop any row withnan- thresh = int, at least how many
nanvalues are needed to drop the row/column
- thresh = int, at least how many
df.fillna(value = )
- Groupby
df.groupby(colum name).mean()- ⭐
df.gorupby().describe()
- Merging, joining, and concatenating
pd.concat(df1, df2, df3, ...),axis = 0by default => join the rows- if column/index doesn't align, will lead to some
nan
- if column/index doesn't align, will lead to some
pd.merge(how = "left/right/innter, etc", on = column name)pd.join(): similar as merge by uses index rather than column for matching
- Operations
- find unique values:
df[col].unique() - find frequency of each unique value in a column:
df[col].value_counts() - ⭐ apply method:
df[col].apply(fun) df.sort_values(by = col_name)df.isnull()- pivotal table:
df.pivot_table(value = , index = , columns = )
- find unique values:
- Data I/O
- .csv:
pd.read_csv(),pd.to_csv("name", index = False) - Excel:
xlrdmodulepd.read_excel("file.xlsx", sheet_name = " ")pd.to_excel("", sheet_name=" ")
- HTML:
lxml/html5lib/BeautifulSoup4pd.read_html("xxx.html")
- SQL:
sqlalchemy=>create_engine
- .csv:
Section 8. Python for data visualization - matplotlib
42-44. matplotlib Parts 1 - 3
-
%matplotlib inline: used in jupyter nb, otherwise need to useplot.show()everytime -
Functional method:
plt.plot(x, y) -
OO method:
fig = plt.figure() # figure -> canvas axes = fig.add_axes() # axes level plot axes.plot(x, y) -
Note:
plt.subplot()$\ne$plt.subplots(), the second creates a list of axes object as doing multiplefig.add_axes()fig, axes = plt.subplots() -
plt.tight_layout(): better show multple subplots
Section 9. Python for data visualization - seaborn
-
distplot: histogramjointplot: scatter plotpairplot: scatter plot matrixrugplot: density plot -> kdeplot -
Categorical plot
barplot: x = cat, y = continuouscountplot: x = cat, y = count of occurrencesboxplot: x = cat, y = continousviolinplot:hue,split = Truestripplot:jitter = Trueswarmplot:stripplot+violinplotfactorplot: can do all above by specifyingkind = ...
-
Matrix plot
- matrix form of the data
heatmap(matrix-df)clustermap: clustered version ofheatmap
-
Grids
g.sns.PairGrid()g.map.diag/upper/lowerFacetGrid: creates subgroups for plotting
-
Regression plot
lmplot(): scatter plot with regression line
-
Style & color
sns.set_style()sns.despine(), inputs: top, right, bottom, leftsns.set_context("poster", font_scale = x)
Section 10. Python for data visualization - Pandas built-in data visualization
- Pandas built-in data visualization
pd[‘col’].plot(kind = “”, …)pd['col'].plot.hist()df.plot.x()
Section 11. Python for data visualization - plotly & cufflinks
- plotly & cufflinks
- Cufflinks: toolbox links plotly & pandas
- plotly is free to use for all functions but needs to pay if like to save online
from plotly import __version__from plotly.offline import download_plotlyjs, init_notebook_mode, plot.iplotinit_notebook_mode(connected = True)- use plotly:
df.iplot()
Section 13. Data capstone project
.groupby().unstack()?.groupby().reset_index()->FacetGrid()
Section 18. K-nearest neighbors
- In KNN, all variables need to be at the same scale, otherwise some variables may dominate the distance calculation
- Find scikit-learn cheatsheet (multiple)
- Tuning parameters
- n-neighbors
- Distance metric
Section 19. Decision trees & random forest
Seciton 20. SVM
-
from sklearn.svm import SVC- Grid search:
from sklearn.model_selection import GridSearchcvgrid = GridSearchCV(SVC(), param_grid, refit = True, Verbose = 3)grid_fit(x_train, y_train)
Section 21. K means clustering (unsupervised)
- Finding K:
- "elbow" method, using SSE: sum of the squared distance between each member of the clusters and its centroid
- In
sklean.dataets, can usemake_blobsto generate fake cluster data - After fitting kmeans to data, can retrive centers from
x.cluster_centersand retrieve cluster label fromx.labels_
Section 22. PCA
-
PCA with python
-
need to standadize the variables before conducitng PCA:
from sklearn.preprocessing import standardScaler scaler = StandardScaler() scaler.fit(df) scaled_df = scaler.transform(df) -
Load PCA
from sklearn.decomposition import PCA
-
Section 23. Reconmmendation system
- Content based: attribute of the items and based on similarity b/t them
- Collaborative filtering (CF): Amazon. based on knowledge of users' attitude to items, "wisdom of crowd"
- more commonly used, produces better results
- able to feature learning on its own
- CF subtypes
- memory-based collaborative filtering
- Model-based collaborative filtering: SVD
- Pandas:
df.corrwith(df): correlation b/t two df columns - Seems the method only uses correlation b/t ratings to check the similarity between 2 movies
Section 25. Big data and spark with python
-
Local vs distributed system:
- distributed means multiple machiens connected in a network
-
Hadoop: a way ot distribute very large files across multiple machines
- Hadoop Distributed File System (HDFS)
- Hadoop also uses MapReduce, that allows computations on that data
- HDFS used blocks of data iwth a size of 128 MB by default; each of these blocks is replicated 3 times; the blocks are distributed in a way to support fault tolerance
-
MapReduce: a way of splitting a computation taks to a distributed set of files (such as HDFS); it consists a job tracker and multiple Task Trackers
-
Spark can be thought as a flexible alternative to MapReduce:
- MapReduce requires files to be stored in HDFS, Spark doesn't
- Spark can perform operations up to 100X fater than MR
-
Core idea of Spark: resilient distributed dataset (RDD), four main features
- Distributed collection of data
- Fault-tolerant
- Parallel operation - partioned
- Ability to use many data sources
-
RDD are immutable, lazily evaluated, and cacheable
-
There are two types of RDD operations:
- Transformations
- RDD.filter
- RDD.map: ~
pd.apply() - RDD.flatMap
- Actions
- First: return the first element of RDD
- Collect: return all the elements of the RDD
- Count: retun NO. of elements
- Take: return an array with the first n elements of the RDD
- Transformations
-
Reduce(): aggregate RDD elements using a function that returns a single element
-
ReduceByKey(): aggregate Pair RDD elements using a function that returns a Pair RDD
- similar ot groupby operation
-
AWS EC2: virtual computer lab
-
login to EC2 using SSH
ssh -i xx.pem ubuntu@public DNS # -
PySpark setup
- source.hashrc # set to Anaconda Python
-
-
Intro. to Spark & Python
-
Notebook magic command:
%% writefile example.txt text ...anything within text is written into "example.txt"
from pyspark import SparkContext SC = SparkContext()- SC has many different methods
-
-
RDD Transformations & Actions
- Transformation => an RDD object
- Action => a local object
Section 26. Neural Nets and Deep Learning
- Perceptron: "feed-formed" model, i.e. inputs are sent into the neuron, are processed, and result in an output
- receive inputs
- weight inputs
- sum inputs
- generate output
- TensorFolow
- Basic idea: create data flow graphs, which have nodes and edges. The array (data) passed along from layer of nodes ot layer of nodes is known as Tensor
- Two ways to use TF:
- Customizable Graph Session
- Sci-kit learn type interface with
contrib.Learn
- TensorFlow basics
- Object/Data is called "Tensor"
tf.Session()=>tensor.run()method- Placeholder: inserts a placeholder for a tensor that will be always fed
- TF estimators
- Steps
- Read in Data (normalize if necessary)
- Train/test split the data
- Create estimator feature columns
- Create input estimator column
- Train estimator model
- Predict with new test input function
- Steps